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Abstract 

Data missing not at random (MNAR) is a major challenge in survey sampling. We pro¬ 
pose an approach based on registry data to deal with non-ignorable missingness in health 
examination surveys. The approach relies on follow-up data available from administrative 
registers several years after the survey. For illustration we use data on smoking preva¬ 
lence in Finnish National FINRISK study conducted in 1972-1997. The data consist of 
measured survey information including missingness indicators, register-based background 
information and register-based time-to-disease survival data. The parameters of miss¬ 
ingness mechanism are estimable with these data although the original survey data are 
MNAR. The underlying data generation process is modelled by a Bayesian model. The re¬ 
sults indicate that the estimated smoking prevalence rates in Finland may be signihcantly 
affected by missing data. 

1 Introduction 

Participation rates in health examination surveys (HES) have been declining over the 
years in many countries. The declining participation rates inflict the estimation of health 
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indicators in many ways. First, the low participation rates compromise the population 
representativeness of the sample because the participants and non-participants differ from 


each other. The non-participants are more often smokers (Shahar et ah, 1996 Tolonen 


et al. 

2005 

) and have higher risk of death ( 

Jousilahti et al. 

2005; 


Harald et ah, 2007) 


compared to the participants. It has also been found that the non-participants tend to be 


2(1(13 

Sogaard et al. 

2004 

), younger persons ( 

Sogaard et al. 

2004 


Tolonen et ah, 2005). Generally, 


the non-participants have been found to have lower socio-economic status (Jackson et al. 


1996 

van Loon et al.| 

2003 

Drivsholm et al. 

2006 

Harald et al., 

2007 

) and lower education 

(Shahar et ah, 

1996 

jSogaard et al. 

2004 

Tolonen et al. 

200 

5) than the participants. 


Second, the declining trends in participation rates may distort the trends of the estimated 
health indicators. Especially, if smokers, heavy alcohol users and obese are less eager to 
participate than they were decades ago, the trends of the health indicators may look more 
positive than they should. 

In statistical terms, data from HES are missing not at random (MNAR) and conse¬ 


quently the missingness mechanism cannot be ignored in the analysis (Little & Rubin 


2002). Although dealing with non-ignorable missingness is challenging in general, there 
are some methods for this. One of these is making functional assumptions for the joint 


distribution of missing data and observed values (Little, 1993; Ekholm & Skinner, 1998) 


This is usually accompanied with a sensitivity analysis for evaluating the effect of as¬ 


sumed missingness mechanism (van Buuren et ah, 1999). If study design is longitudinal. 


the modelling of non-ignorable missingness may be based on partially available repeated 


measurements ( 

Ibrahim & Molenbergf 

LS, 2 

009 

). Recently, a subsample ignorable likeli- 

hood (SIL) approach ( 

Little & Zhang, 

201 

L) was proposed for situations, where full data 


are available for some variables while the other variables have missing data. 

We propose an approach to correct for non-ignorable missingness in situations where 
follow-up data are available for both participants and non-participants. Finland is one 
of the few countries where follow-up data for the entire survey sample can be obtained 
through a record linkage to the administrative registers. Naturally, the follow-up data 
will not be available right after the survey but only many years later. Without further as¬ 
sumptions, the trends of health indicators can be therefore corrected only retrospectively. 

As an illustration for our approach, we use the data from the National FINRISK studies 
(Laatikainen et ah, 2003 Harald et ah, 2007), which are one of the data sources used to 
evaluate public health in Finland. The data from the surveys carried out in 1972, 1977, 
1982, 1987, 1992 and 1997 are included. The participation to the physical measurements 
have decreased from 95% in 1972 to 74% in 1997. Note that in the next section we define 
participation differently. Under the decreasing participation, we estimate the prevalence 
of smoking utilizing the follow-up data available from the registers. 

The relevant details of the FINRISK surveys are presented in Section In Section 
1^ a Bayesian model is built for the analysis of non-ignorable missing data. Section 
compares the trends for non-ignorable and ignorable approaches, and Section concludes 
the paper. 


2 FINRISK data and linked register data 

The National FINRISK Study (earlier North Karelia Project) data arose from a setup 
where the original aim was to intervene to people of North Karelia via a health education 
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campaign. Later, the data have been collected every hve years to measure the risk factors 
of key diseases and to monitor public health. In addition to North Karelia, the neighbour 
province of Northern Savonia has been included in studies since the beginning. Later, 
Turku and Loimaa area, Helsinki and Vantaa area and Oulu province have joined the 
survey. The data from the surveys conducted in 1972-1997 are used in this paper. 

Sampling frame for the surveys has been the National Population Register. The survey 
design has changed over the study years, see Table but at each study the sampling has 
been stratihed among the participating areas. In 1972, the sampling was systematic on 
birthdays, and people aged between 25-59 were sampled. In 1977 the simple random 
samples was drawn from people aged between 30-64. In 1982, the survey was balanced 
between the 10-year age-groups and 25-64-years-old people were sampled. In years 1987, 
1992, and 1997 the sampling design was balanced sampling between 10-year age-groups 
within genders. In 1997, the eligible age was extended to 25-74-years-old in North Karelia 
and in Helsinki and Vantaa area. 

The participation is dehned as answering to the question about daily smoking. This 
dehnition leads to lower participation rates than reported elsewhere because some individ¬ 
uals participated otherwise but skipped the smoking questions. The participation seem 
to depend on age and gender, but possibly also on smoking, which is to be investigated. 
The age-dependency of the participation rate and its change over the period 1972-1997 
is shown in Figure [T] Smoking, together with other health indicators, was measured by 
using a multi-page questionnaire. Smoking questions classihed each person either non- 
smokers, ex-smoker or current smoker. We model smoking using two classes, where the 
ex-smokers and non-smokers are considered as the same. 


Men Women 




age age 


Figure 1: Participation rate as a function of age in 1972 and 1997. Each circle and 
triangle represents the observed proportion of participants within one-year-age-group over 
all study areas studied that year. The graph shows that the participation rates have 
decreased in all age groups for both men and women. The solid lines are calculated using 
Locally Weighted Regression (LOESS) 


(Cleveland, 1979 


The sources of the follow-up data are Care Register for Health Care (HILMO) (Na¬ 


tional Institute for Health and Welfare, 2014) and the cause of death data (Statistics 
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Finland, 2014). The follow-up data are linked to the survey data by personal identihca- 


tion number. The follow-up data contain the date and the cause of death, and the cause of 
hospitalization. The diseases considered here are lung cancer (ICDIO: C34, ICD9/ICD8: 
162) and chronic obstructive pulmonary disease (COPD) (ICDIO: J41-J44, ICD9/ICD8: 


& Hoffmann 

1994 

are available 

for all 


Mannino & Buist, 2007 Cornheld et ah, [2009 ). The follow-up data 


samples. The effect of smoking to the onset of lung cancer and COPD for men and women 
is illustrated in Figure 


■g 

(0 


O 


Men 


Women 




age 


age 


Figure 2: Cumulative hazard estimates with conhdence intervals for smoking-based dis¬ 
eases of lung cancer and COPD. Graphs are produced using the participant data only. 


We denote our variables as follows. The smoking indicator variable is denoted as 17 for 
person i. Background variables X* = {xii,X 2 i,x^i,Xii) for person i include the age at the 
beginning of the follow-up Xu, area X 2 i and gender x^i, which origin from the registers. 
The variable x^ is the study year. 

The sample indicator mu = 1 indicates that person i has been chosen to a survey 
sample, and participation indicator M 2 i = 1 indicates that he or she has participated to 
the survey. If mu = 0 then M 2 i must also be 0 because people outside of the survey 
sample can not take part. Variables Xi are observed for both the participants and non¬ 
participants while Yi is observed only from the participants. 

The follow-up data consist of time-to-event-variable T* and event indicator rj, where 
Ti is the age at the diagnosis of the disease, i.e. the onset of lung cancer or COPD. 
Variable Tj is observed for the participants and non-participants. If a person has not 
been diagnosed until the end of follow-up period (31st December 2011), or if person dies 
for other causes, then the time-to-event-variable becomes right censored. In the case of 
right censoring we know only that Tj > Cj where Cj is persons age at censoring or age at 
death. The date of diagnosis can be the same as date of death, if person has not been 
diagnosed earlier and lung cancer or COPD is the cause of death. If person recovers from 
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lung disease and becomes repeatedly diagnosed, the time-to-event-variable holds the time 
of the earliest diagnosis. 


3 Bayesian model for non-participation and smoking 

3.1 Dependency structure and modelling assumptions 

We present the structure of the model in Figure [fusing the concept of causal model with 
design (Karvanen, 2014). The Figure represents a causal model at the bottom where 
background variables Xj = {xii,X 2 i,xzi,Xii) affect the probability of smoking P{Yi) and 
the risk of lung disease P{Ti). In addition, smoking also has an effect on the risk of 
lung disease. These relations are described as arrows Xj —)■ Tj,Xj —)■ y and y —)■ Tj in 


the model graph. The causal relations of smoking and lung cancer (Doll & Hill, 1956 


Wynder & Hoffmann 

1994 

Cornfield et al. 

2009 

) and smoking and COPD ( 

Mannino & 


Buist, 2007) are known to exist. Also, it has been observed that the prevalence of smoking 


varies depending on the area, gender and age (Peltonen et al., 2008 Borodulin et ah, 2013) 


Persons belonging to the sample have mu = 1, and are selected from population Q, which 
in this case is the general Finnish population in geographically dehned areas and age 
groups specified above. Sampling is based on the background register data, which is why 
we have Xj —)• mu in Figure Participation, which is indicated by M 2 i = 1, is affected 
by background variables (Xj —)■ M 2 i) and smoking (17 —)■ M 2 i). People may participate 
only if they are selected to the sample, which is indicated by the arrow mu M 2 i in the 
graph. If a person participates, he or she has M 2 i = 1, and thus Y* = I 7 . Otherwise, 
smoking indicator is missing Y* = NA. The background information as well as survival 
information T* are collected for all persons in the sample. The follow-up variable Ti is a 
vector of two elements, the actual time variable ti, either for the event time or censoring 
time, and an indicator variable for censoring, denoted as r^. The notation for this is 


Ti = {ti,ri) = 


(tj,0), if an event is observed 


(tj, 1), if an event is right censored. 
The observed T* is then defined as 


_ 



if person i belongs to a sample: mu = 1 
if person i does not belong to a sample: mu = 0. 


The censoring due to deaths other than lung cancer or COPD is informative because 
smoking is a risk factor for many common causes of death. The usual way to deal with this 
kind of informative censoring is to define an additional endpoint for other deaths and use 


a competing risk model (Kalbfleisch & Prentice, 2002). However, this would create new 


problems because we would implicitly assume that all differences in the mortality between 
participants and non-participants are due to smoking. In reality, participants and non¬ 
participants differ also by many other risk factors which work as confounders. Therefore, 
we have chosen to use only smoking-specific survival outcome in the analysis and to 
treat the censoring as non-informative. The implications to the results and alternative 
approaches are discussed in Section 

In Figure the non-participation depends on smoking status 17, which means that 
the missingness mechanism is non-ignorable. In general, the non-ignorable missingness 
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measurement of survival for disease 

measurement for smoking indicator 

measurement of background information 

non-response 

sampling of individuals 
population 



O Xj background information: age at study, area, gender, year 
O Yj smoking status 

Q T| disease time-to-event status 
9 Xj* measured register-based background information 

• Yj* measured (interview) smoking status 

0 Tj* measured time-to-event disease outcome 
O rOj-jj population indicator 
^ m^^j sampling indicator 

• Mji participation indicator 


Figure 3: Illustration of variable dependencies and the data-collection process. 


mechanism is not estimable from data. To overcome this issue, we use the follow-up data 
to make an additional assumption on the missingness mechanism. 

We want to estimate the smoking prevalence for the whole sample, so we need to 
estimate the distributions 

P{Yi) = P{M2^ = l)PinM2i = 1 ) + P{M2i = 0)PinM2i = 0 ) ten. ( 1 ) 

On the right hand side of Equation ([^ the probability of smoking for non-participants 
P{Yi\M 2 i = 0 ) cannot be estimated using the observed data without making further 
assumptions. This may be written as 

P{Y,\M2i = 0) = / J PiYilTu X,, M2i = 0)P(Ti, X,\M2i = Q)dXdTi, t e O 

where P(yi|Ti,Xj,M 2 i = 0 ) is not estimable but P(Tj,Xi|M 2 j = 0 ) is estimable from 
observed data. We now assume that 

p(r,|T„Xj,M2, = 0) = p(y;|T„Xj,M2, = 1), ten (2) 

which means that, given the observations Tj and Xj, additional observation M 2 i = 1 or 
M 2 i = 0 does not give us any further understanding about the distribution of Tj. Thus, 
for the rest of our paper, we restrict the models of interest to the cases for which the 
Equation ([^ holds. Now the smoking prevalence Q can be estimated if the probabilities 
P(M 2 * = 1 ), PiY,\M 2 ^ = 1 ), P(M 2 . = 0 ), P(T„W|M 2 j = 0 ) and P(X,|T„W,M 2 . = 1 ) 
can be estimated. The assumption (|^ can be justihed if the relation Tj —)■ Tj is strong, 
i.e. the early onset of lung cancer or COPD is a strong indicator of smoking. In practice, 
the model parameters for relations of Xj, Tj and Tj are estimated using data from the 
participants only. 

3.2 Construction of posterior distribution 

The model consists of two sub-models: a survival model for T*, and a logistic regression 
model for the smoking indicator Y*. Next, the parametric forms for sub-models are 
considered. 
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Time-to-disease variable T*\mii = 1 is assumed to follow Weibull distribution with 
a common shape parameter a and scale parameter bi varying person by person. The 
distribution is left-truncated by the person’s age to* = xu at the beginning of follow-up. 
The likelihood contribution for observed disease cases can be written as 

^ N a 6 fij““^exp(—fefiT) 

p(T = tii\a, 6 , Ti = 1, T > toi) = --- - for tu > foi, 

[i- - -t [toi)) 

where F{t) is cumulative distribution function for Weibull distribution. For censored 
cases i : Ti = 0 the likelihood contribution is the survival function 

S{T > tii\a,b,ri = 0,T > toi) = exp(- 6 (fij“ -to*“)) for tu > fo*- 

Parameter varies person by person based on the covariate measurements 


log(6i) = 7o + ^iX3i + 72^* + ^sXsiYi 

+ 743^31 + 744^4fi + 7454^51 + 746^6i 

+ Ib'iX^iA’ii + + 'y55X3iA^i -f 'y^QX^iAgi (3) 

+ 762-^21 + ' 163^31 -|- 764-D4i + 765-D5i + 

+ 'y72X3iD2i + 'y73X3iDsi + 7743^31-041 + 7753^3i-05i + 776 X 3 *^ 61 , 

where parameter 70 corresponds to lung disease risk of non-smoking men at baseline (year 
1972, North Karelia), 71 indicates the difference of risks for non-smoking men and women, 
72 indicates the effect of smoking for men at baseline and 73 describes how disease risk for 
smoking women is different from the risk of smoking men (at baseline). The 742 ,..., 746 
stand for how the other areas differ from the baseline area (North Karelia) for men. 
The coefficients 753 ,..., 753 describe how the last-mentioned quantities differ between the 
women and men. The 702 , • ■ ■ ,766 are the differences of the study year to the baseline 
study (year 1972) for men, and 772 ,..., 776 are the differences of women and men for that 
particular study year. In Equation (|^ the variables ^ 2 ^,..., Agi are indicators for the 
study area such that ^424 = 1 for the North Karelia (area 2), ^434 = 1 for the Northern 
Savonia (area 3), A^i = 1 for Turku and Loimaa (area 4), A^i = 1 for Helsinki and Vantaa 
(area 5), and A^i = 1 for Oulu province (area 6 ). Similarly, Du, - ■ ■, are indicators 
about the study year such that Du = 1 for 1972, D 2 i = 1 for 1977, D^i = 1 for 1982, 
D 4 i = 1 for 1987, D^i = 1 for 1992, and D^i = 1 for 1997. 

The smoking indicator is modelled also using logistic regression. The effects of gender 
xsi, year of birth xurthy = X 4 i — xii and study year X 4 j are included in the model. We 
assume that the smoking indicator is Bernoulli distributed 

Yi ~ Bernoulli(sj) 


with probability Si such that 

logit(Sj) (XQ^a,u,g T (X.\^a,u,g{xbirth,i 1930), (4) 

where a = X 2 i is area, g = X 3 j is gender and u = xa is study year for person i. The 
coefficient a 3 ,a,u,g represents the intercept term for persons living in area a, of gender g, 
who were born in 1930 and were selected to the sample in year u. The year 1930 was 
chosen as a reference level because all the studies have some participants who were born 
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in 1930. The coefficients ai^a,u,g represents the impact of year of birth to the probability 
of smoking. 

The information on the area (North Karelia or Northern Savonia) is missing for non¬ 
participants (2,664 in total) in 1972 and 1977. This missingness is due to accidentally 
lost data. These values are imported using multiple imputation with fixed probabilities 
F(area was Northern Savonia|1972) = 0.495 and P(area was Northern Savonia|1977) = 
0.493. The imputation is not necessary for model fitting purposes, but is needed for the 
comparison of the areawise smoking trends. 


3.3 Model fitting and model diagnostics 


The model was built and htted using JAGS (Plummer, 2003), which is a tool for Bayesian 


analysis (Gelman et ah, 2013) of graphical models using Markov chain Monte Garlo 


(MGMG) (Robert & Gasella, 2004). For all parameters the prior distributions were set as 
normal distributions with zero mean and variance cr^ = 1, 000. Regarding the scale of the 
parameters these priors are non-informative. Eight chains were run in parallel. Each of 
the chains had 200, 000 iterations from which the first 40, 000 were discarded as a burn-in. 
From the remaining 160, 000 iterations, the values of each 250th iteration were stored to 
produce eight hnal thinned chains of 640 iterations. In total we have 640 * 8 = 5,120 
realizations from posterior to use. 

The MGMG convergence was monitored by Brooks-Gelman-Rubin convergence diag¬ 


nostic (Brooks & Gelman, 1998). The diagnostics of all parameters were below 1.01 when 


values below 1.05 indicate convergence. One of the MGMG chains of the final model is 
visualized for two parameters in the Figure]^ The Figure shows that the Weibull shape- 
parameter is less well mixed than the other parameter. This is due to large autocorrelation 
caused by dependency on Weibull scale parameter. The better mixing on the smoking 
coefficient 72 is also visualized in the Figure. The majority of the parameters have good 
mixing. Posterior summaries of regression coefficients are given in Table and Table 
see Appendix A. 

The model diagnostics included a graphical comparison of the posterior predictive 
distribution against the observed values. The model was concluded to have a good £t to 
the data. 


4 Comparison of corrected and uncorrected smoking 
trends 

To obtain knowledge about the smoking prevalence for the study populations, we apply 


data-augmentation (Tanner & Wong, 1987) to impute the missing values of smoking for 
non-participants, and take into account censoring of T*. Because we apply Bayesian 
inference, the imputations are drawn from the posterior predictive distribution. First, the 
posterior samples of the regression coefficients are obtained using MGMG and participants 
data. Imputations of the smoking indicator for non-participants are drawn using the 


following procedure, which we implemented in R (R Gore Team, 2014). The imputation 


depends on whether the event is observed or censored. If Tj is censored, then first event¬ 
time Ti for Ti is generated using 


T, ~ p(T.iw) = p{nx„ y; = i)p{y; = i|w) + p(t,iw, y; = o)p(k; = oiw). 


9 























5.2- 



1 - 1 - 1 - 1 — 

0 200 400 600 

iteration 



iteration 


Figure 4: Chain plots of MCMC computation. Left: Weibull shape-parameter a. Right: 
regression coefficient of smoking variable 72 of survival model. 


After that, use the imputed event-time Tj to simulate Yi ~ P(Yi\Ti, X,). If Ti is observed, 
then simulate Yi ~ P(l^|Tj,Xj) straightforwardly based on the observed event-time. 

After the imputation, the survey sampling design has to be taken into account. We may 
treat data with each imputation as a full dataset. To provide area-specihc population-level 


estimates we may then utilize inverse sampling probability weights (Lehtonen & Pahki- 


nen 


2004). 


In addition to utilizing the sampling weights, the estimates were adjusted 

in order to make 


using WHO Scandinavian standardization weights (Ahmad et al 
the smoking rates internationally comparable. As an outcome, we obtain area-specihc 
trend estimates for both genders corresponding to each imputation. These trends can be 
considered as samples from the posterior distribution of the trends. The estimated model- 
based corrected trends are compared to the corresponding original trends in Figure for 
North Karelia. The original or uncorrected trends were produced from the participant 
data only. The adjustment for sampling design and the WHO weights was the same as 
for the model-based trends. 

In Figure the difference between the trends increases as the participation rate de¬ 
creases. In addition, it seems that the difference of the trends in most time-points is larger 
for women than for men. On the other hand, the largest difference in the corrected and 
non-corrected prevalence estimates is 6.6 percentage points (relative difference of 25%), 
which is observed for men in Helsinki and Vantaa in 1997. The comparison of the model- 
based and original smoking prevalence trends for the study year 1997 is presented in Table 


5 Discussion 

We have proposed an approach to overcome the challenges with non-ignorable missing 
data in epidemiological studies and have applied it to estimate the population trends of 
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Table 2: Observed and model-based smoking proportions for the stndy in 1997 adjnsted 
nsing WHO Scandinavian standardization weights. The two rightmost colnmns describe 
the 95% credible intervals of model-based trends. Participant smoking is the same as 
’’Original trend” in Fignrej^ 


Gender 

Area 

Participant 
smoking (%) 

Model-based 
total smoking (%) 

95% Credible Interval 

Men 

North Karelia 

26.8 

31.6 

29.2 

33.9 

Men 

Northern Savonia 

30.7 

31.8 

29.5 

34.0 

Men 

Tnrkn and Loimaa 

32.4 

33.7 

31.1 

36.2 

Men 

Helsinki and Vantaa 

26.1 

32.7 

30.1 

35.6 

Men 

Onln province 

30.1 

32.3 

29.5 

35.2 

Women 

North Karelia 

14.2 

18.3 

16.3 

20.7 

Women 

Northern Savonia 

17.0 

19.1 

17.1 

21.1 

Women 

Tnrkn and Loimaa 

20.5 

23.6 

21.3 

26.0 

Women 

Helsinki and Vantaa 

22.6 

27.7 

25.3 

30.4 

Women 

Onln province 

19.1 

22.2 

20.0 

24.3 


Men 


Women 



Year Year 


(Participation) 


(Participation) 


Fignre 5: Model-based trend and original trend for men (left) and women (right) in North 
Karelia province. The North Karelia was chosen becanse of the most visible change in the 
trends among the areas. Two dotted lines represent 95% credible interval of the posterior 
distribntion for corrected trends. Both the model-based and the original trend nse WHO 
Scandinavian standardization weights. 
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smoking in Finland in 1972-1997. The approach uses follow-up data to obtain information 
on risk factors missing at baseline. Thanks to the administrative registers in Finland, the 
follow-up data are available also for non-participants. Smoking has been selected as the 
risk factor of interest because it is a strong risk factor of lung cancer and COPD and 
potentially has an effect to the decision on the HES participation. 

We evaluated the proportion of smokers combining the available information from 
both the participants and non-participants for the FINRISK study. Our results indicate 
that the levels of smoking prevalence is affected when the information provided by lung 
cancer and COPD time-to-event data is accounted to provide an estimate for the smoking 
of non-participants. 

In general, statistical modelling under the non-ignorable missingness requires external 
information on the missingness mechanism. It can be argued that the inclusion of follow¬ 
up data provides the information needed. The situation can be formally described using 
causal models with design and then modelled by a Bayesian model. The idea of utilizing 
existing causal knowledge to £x non-ignorable missingness is not restricted to survival 
models. 

The approach is limited by the availability of follow-up data. It takes years or decades 
until the follow-up data on lung cancer and COPD can be used to model the missing data 
mechanism. It is unclear to what extent the approach can be applied in other countries 
because register-based baseline and follow-up data sets are not usually available for non¬ 
participants. Although the approach may not be directly applicable in a study, the results 
from other similar studies, where the approach has been applied, may provide a starting 
point for the prior setting and the sensitivity analyses. 

Censoring was treated as non-informative, which may cause some bias to the estimates. 
As smoking is a risk factor for many common causes of death, an individual censored due 
to other deaths is more likely to be a smoker than an individual censored due to the end 
of the follow-up. It is therefore expected that the actual proportions of smokers could 
be even higher than the corrected proportions reported here. Improved estimation would 
require a competing risk approach with a a comprehensive set of risk factors and a number 
of disease-specific endpoints. This is left as future work. 

Inclusion of information about smoking as a time-dependent process would yield more 
realistic expressions of smoking in different age-groups. The effect of smoking years could 
be then considered as a covariate for the lung diseases. With the current model, it is 
assumed that observed lung disease diagnosis e.g. at age 50 is equally strong indication 
about smoking, no matter if the person is diagnosed five or 25 years after the survey. In 
reality, individuals may have started or stopped smoking after the survey was conducted. 

The presented approach may be utilized with data arising in forthcoming FINRISK 
surveys. In addition, the model could be used to give recommendations on the sample 
size and the stratification. 

Our work reminds that data with MNAR-situation may be changed to MAR using 
additional assumption and external information. This allows us to provide estimates that 
describe the whole population instead of the restricted sample of survey participants. 
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Appendix A: Regression coefficients 


Table 3: Posterior summaries of the estimated parameters for the smoking model, reduced 
to the parameters of North Karelia. 


Description of related variable 


Parameter 

Mean SD 

2.5% 

97.5% 

men born at 1930 in 1972 


ao,1972,1,2 

0.086 0.043 

0.003 

0.168 

men born at 1930 in 1977 


ao,1977,l,2 

-0.294 0.047 

-0.384 

-0.203 

men born at 1930 in 1982 


O;0,1982,l,2 

-0.672 0.069 

-0.806 

-0.538 

men born at 1930 in 1987 


ao,1987,l,2 

-0.888 0.084 

-1.052 

-0.724 

men born at 1930 in 1992 


ao,1992,l,2 

-1.092 0.159 

-1.409 

-0.779 

men born at 1930 in 1997 


ao,1997,l,2 

-1.449 0.107 

-1.661 

-1.244 

women born at 1930 in 1972 


ao,1972,2,2 

-2.106 0.070 

-2.242 

-1.971 

women born at 1930 in 1977 


00,1977,2,2 

-2.452 0.086 

-2.625 

-2.287 

women born at 1930 in 1982 


00,1982,2,2 

-2.361 0.111 

-2.583 

-2.153 

women born at 1930 in 1987 


00,1987,2,2 

-2.412 0.128 

-2.670 

-2.164 

women born at 1930 in 1992 


00,1992,2,2 

-2.599 0.219 

-3.039 

-2.183 

women born at 1930 in 1997 


00,1997,2,2 

-2.833 0.204 

-3.239 

-2.449 

difference of year of birth to 1930 (men in 

1972) 

01,1972,1,2 

0.001 0.004 

-0.007 

0.009 

difference of year of birth to 1930 (men in 

1977) 

01,1977,1,2 

0.008 0.005 

-0.0004 

0.017 

difference of year of birth to 1930 (men in 

1982) 

01,1982,1,2 

0.013 0.005 

0.003 

0.022 

difference of year of birth to 1930 (men in 

1987) 

01,1987,1,2 

0.019 0.005 

0.009 

0.028 

difference of year of birth to 1930 (men in 

1992) 

01,1992,1,2 

0.017 0.007 

0.002 

0.032 

difference of year of birth to 1930 (men in 

1997) 

01,1997,1,2 

0.029 0.005 

0.019 

0.038 

difference of year of birth to 1930 (women 

in 1972) 

01,1972,2,2 

0.043 0.007 

0.030 

0.056 

difference of year of birth to 1930 (women 

in 1977) 

01,1977,2,2 

0.050 0.008 

0.034 

0.066 

difference of year of birth to 1930 (women 

in 1982) 

01,1982,2,2 

0.057 0.007 

0.044 

0.070 

difference of year of birth to 1930 (women 

in 1987) 

01,1987,2,2 

0.049 0.006 

0.037 

0.062 

difference of year of birth to 1930 (women 

in 1992) 

01,1992,2,2 

0.049 0.009 

0.031 

0.066 

difference of year of birth to 1930 (women 

in 1997) 

01,1997,2,2 

0.049 0.007 

0.035 

0.064 
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Table 4: Posterior summaries of the estimated parameters for the survival model (includes 
all parameters). 


Description of related variable 

Parameter 

Mean SD 

2.5% 

97.5% 

Weibull shape-parameter 

a 

4.257 0.111 

4.041 

4.475 

intercept (men) 

70 

-21.848 0.501 

-22.817 

-20.859 

gender (women) 

71 

-1.352 0.164 

-1.665 

-1.031 

smoking 

72 

1.772 0.061 

1.653 

1.893 

interaction of smoking and gender 

73 

0.559 0.116 

0.328 

0.786 

Northern Savonia 

743 

0.070 0.054 

-0.036 

0.175 

Turku and Loimaa 

744 

-0.298 0.085 

-0.466 

-0.134 

Helsinki and Vantaa 

745 

-0.389 0.131 

-0.652 

-0.139 

Oulu province 

746 

-1.290 0.300 

-1.931 

-0.743 

interaction of Northern Savonia and women 

753 

-0.274 0.131 

-0.528 

-0.023 

interaction of Turku and Loimaa and women 

754 

0.654 0.161 

0.337 

0.970 

interaction of Helsinki and Vantaa and women 

755 

0.509 0.241 

0.037 

0.963 

interaction of Oulu province and women 

756 

1.072 0.477 

0.104 

2.006 

year 1977 

762 

-0.242 0.074 

-0.382 

-0.094 

year 1982 

763 

0.017 0.074 

-0.125 

0.164 

year 1987 

764 

-0.090 0.105 

-0.295 

0.117 

year 1992 

765 

-0.185 0.126 

-0.433 

0.062 

year 1997 

766 

0.134 0.107 

-0.075 

0.345 

interaction of women and year 1977 

772 

0.269 0.162 

-0.042 

0.582 

interaction of women and year 1982 

773 

-0.240 0.174 

-0.581 

0.092 

interaction of women and year 1987 

774 

0.182 0.212 

-0.234 

0.600 

interaction of women and year 1992 

775 

0.506 0.225 

0.058 

0.950 

interaction of women and year 1997 

776 

0.343 0.212 

-0.078 

0.753 
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